ENH: Add lazy copy for take and between_time #50476

phofl · 2022-12-28T22:26:56Z

xref ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jorisvandenbossche · 2023-01-03T07:59:18Z

pandas/core/generic.py

@@ -3779,6 +3779,8 @@ def _take(

        See the docstring of `take` for full explanation of the parameters.
        """
+        if axis == 0 and np.array_equal(indices, np.arange(0, len(self))):


Can we do a quick check of how costly this is compared to the actual take? (to ensure we don't introduce a performance regression for the other cases)

Maybe could also first check if the length of indices is equal to len(self)

Maybe could also first check if the length of indices is equal to len(self)

np.array_equal actually already does that, so doing that ourselves is not needed

I profiled this, the cost is reduced if the DataFrame gets larger.

The actual problem is a bit different though: array_equal has no early exit, even if the first value is different, it checks the whole array (wanted to bring this up tomorrow as well). We should probably write something ourselves, because I guess we will need that a couple of times.

phofl · 2023-01-07T11:39:18Z

Switched to array_equal_fast here

pandas/core/series.py

Co-authored-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche · 2023-01-11T10:01:24Z

pandas/core/generic.py

+                and array_equal_fast(
+                    indices,
+                    np.arange(0, len(self), dtype=np.intp),


Further idea to potentially speed this up: right now we only use array_equal_fast to check an array against a standard 0...n indexer, I think?
For that specific case, we don't actually need to create this second array, but could just use the iteration variable inside array_equal_fast (it only needs the length).

(now, I don't know if we would want to start using array_equal_fast for other cases)

(that also avoids creating this additional array even for the fast cases where the length doesn't even match)

I’ll take a look at this in a follow up if ok to check how big the performance improvement would be

jorisvandenbossche · 2023-01-11T10:05:58Z

pandas/tests/copy_view/test_methods.py

@@ -481,6 +481,39 @@ def test_assign_drop_duplicates(using_copy_on_write, method):
    tm.assert_frame_equal(df, df_orig)


+@pytest.mark.parametrize("obj", [Series([1, 2]), DataFrame({"a": [1, 2]})])
+def test_take(using_copy_on_write, obj):
+    obj_orig = obj.copy()


Small nitpick: can you add a comment here that is testing the corner case of taking all rows? (because in general take always by definition returns a copy)

Co-authored-by: Joris Van den Bossche <[email protected]>

ENH: Add lazy copy for take and between_time

7c5513c

jorisvandenbossche mentioned this pull request Jan 3, 2023

ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473

Closed

73 tasks

jorisvandenbossche added the Copy / view semantics label Jan 3, 2023

jorisvandenbossche added this to the 2.0 milestone Jan 3, 2023

jorisvandenbossche reviewed Jan 3, 2023

View reviewed changes

phofl and others added 6 commits January 6, 2023 13:46

Merge branch 'main' into cow_take

a21a2f3

Merge remote-tracking branch 'upstream/main' into cow_take

7d48fea

Use array equal fast

700be46

Fix cond

697bb14

Fix condition

f694113

Use array_equal_fast

d66e5e3

jorisvandenbossche reviewed Jan 7, 2023

View reviewed changes

pandas/core/series.py Outdated Show resolved Hide resolved

phofl and others added 2 commits January 7, 2023 15:41

Update pandas/core/series.py

258b1bc

Co-authored-by: Joris Van den Bossche <[email protected]>

Fix test

36287a3

jorisvandenbossche reviewed Jan 11, 2023

View reviewed changes

jorisvandenbossche approved these changes Jan 11, 2023

View reviewed changes

phofl and others added 3 commits January 11, 2023 13:47

Add comment

eabf2fa

Merge branch 'main' into cow_take

297a828

Remove import

64afeb2

jorisvandenbossche merged commit 62521da into pandas-dev:main Jan 13, 2023

phofl deleted the cow_take branch January 13, 2023 08:23

phofl added a commit to phofl/pandas that referenced this pull request Jan 13, 2023

ENH: Add lazy copy for take and between_time (pandas-dev#50476)

63052a6

Co-authored-by: Joris Van den Bossche <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add lazy copy for take and between_time #50476

ENH: Add lazy copy for take and between_time #50476

phofl commented Dec 28, 2022 •

edited

Loading

jorisvandenbossche Jan 3, 2023

jorisvandenbossche Jan 3, 2023

phofl Jan 3, 2023

phofl commented Jan 7, 2023

jorisvandenbossche Jan 11, 2023 •

edited

Loading

jorisvandenbossche Jan 11, 2023

phofl Jan 11, 2023

jorisvandenbossche Jan 11, 2023

jorisvandenbossche Jan 11, 2023

phofl Jan 11, 2023

ENH: Add lazy copy for take and between_time #50476

ENH: Add lazy copy for take and between_time #50476

Conversation

phofl commented Dec 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Jan 7, 2023

jorisvandenbossche Jan 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Dec 28, 2022 •

edited

Loading

jorisvandenbossche Jan 11, 2023 •

edited

Loading